Skip to content

Fix performance when checking dataset version file count#12494

Merged
landreev merged 10 commits into
developfrom
performance-fix-for-dataset-version-has-files
Jul 1, 2026
Merged

Fix performance when checking dataset version file count#12494
landreev merged 10 commits into
developfrom
performance-fix-for-dataset-version-has-files

Conversation

@stevenwinship

Copy link
Copy Markdown
Contributor

What this PR does / why we need it: Performance issues with sql call
#12284

Which issue(s) this PR closes:

  • Closes #

Special notes for your reviewer:

Suggestions on how to test this:

Does this PR introduce a user interface change? If mockups are available, please link/include them here:

Is there a release notes update needed for this change?:

Additional documentation:

@stevenwinship stevenwinship self-assigned this Jun 29, 2026
@stevenwinship stevenwinship moved this to In Progress 💻 in IQSS Dataverse Project Jun 29, 2026
@stevenwinship stevenwinship added this to the 6.11 milestone Jun 29, 2026
@stevenwinship stevenwinship added Size: 10 A percentage of a sprint. 7 hours. Feature: Performance & Stability labels Jun 29, 2026
@coveralls

coveralls commented Jun 29, 2026

Copy link
Copy Markdown

Coverage Status

Coverage is 25.022%performance-fix-for-dataset-version-has-files into develop. No base build found for develop.

landreev
landreev previously approved these changes Jun 29, 2026

@landreev landreev left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good; extra logic for guest user helps.

@github-project-automation github-project-automation Bot moved this from In Progress 💻 to Ready for QA ⏩ in IQSS Dataverse Project Jun 29, 2026
@github-actions

This comment has been minimized.

@stevenwinship stevenwinship removed their assignment Jun 29, 2026
@@ -257,15 +260,17 @@ public boolean canIssueDeleteDatasetCommand(DvObject dvo){
// PUBLISH DATASET
public boolean canIssuePublishDatasetCommand(DvObject dvo){

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

These are in the permissionsWrapper class - shouldn't the results also be cached for the life of the bean?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why would we cache the results? It would change when a file upload completes

@qqmyers qqmyers Jun 29, 2026

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If caching isn't appropriate, should the method be in the PermissionServiceBean instead? The *Wrapper classes are viewscoped and should just have cached info (probably not enforced). That would make the call accessable via API as well.

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Generally, we want everything that is called from JSF rendered="#{...}" attributes cached (because these are evaluated 7 times). Probably a good idea even when there is no reason to expect the task to be expensive.

In this specific case, I am actually not sure off the top of my head about the "would change when a file upload completes"... PermissionsWrapper is @ViewScoped. Does that mean that it will be a new "View" when the page re-renders itself after an upload completes? Or the same one?

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Would be easy to test though.

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good point - I'm not sure w/o digging - @viewScoped would stay as long as we're just doing ajax updates? We also cache in the DatasetPage at times - that would be a place where the cache could be emptied when a file upload completes.

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

looked into it some more

  1. From a quick test, caching that info within a View does NOT prevent the page from properly updating itself once one or more files has been uploaded (that re-rendering apparently counts as a new View).

  2. Yes, we DO want to cache this (on top of caching the result of permissionService...canIssue(command) that we are already doing there). JSF is calling these 2 methods over and over incessantly. Having that NativeQuery executed every time feels wrong.

  3. I mentioned in the perf. issue that "The query count goes from 9K+ down to 813 ... This is ~100 more than it has been consistently between the last few releases, but ..." That was from a test of the first iteration of this PR, that was not checking for || u instanceof GuestUser. The extra 100 queries were solely from the native query and whatever it instantiated in the datasetVersionService.hasFiles() performed on the 10 underlying datasets. Re-testing the PR in its current state brought that query count down to 715. (it was 714 in 6.10.1).
    It will be worse for the dataset page and a logged in browser user.

... So yes, we want to cache.
We'll wrap it up tomorrow.

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

FYI, this was my quick experiment last night: #12495 (this is a pr into this branch).
I am not positive if this is the best way of handling this.
I'm hoping to be able to focus on other work today.

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

One last comment:
What's truly annoying, is that the only serious performance degradation this was causing was on the dataverse.xhtml page; where, from what I understand, the check on the number of files may not even be relevant. (the "can publish" check there seems to be a proxy for "can see [some] curation info" - ?)
In dataset.xhtml, where we do need to know whether the dataset has files or not, the original ...Version.getFileMetadatas().size() was NOT even a problem, I don't think - because the filemetadatas are already instantiated, via a monster findDeep() query.

... we should avoid dumping any excessive amounts of work into this, in other words.

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

(although permissionsWrapper.canIssuePublishDatasetCommand() is also called from FilePage.java; it'll definitely help to avoid unnecessarily instantiating all the files in the parent dataset in there as well)

@@ -257,15 +260,17 @@ public boolean canIssueDeleteDatasetCommand(DvObject dvo){
// PUBLISH DATASET
public boolean canIssuePublishDatasetCommand(DvObject dvo){
User u = session.getUser();

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

doesn't getUser return a GuestUser? i.e. getAuthenticatedUser() is needed for the check?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

fixed

@github-actions

This comment has been minimized.

@landreev landreev dismissed their stale review June 29, 2026 21:29

(I meant to add that as a comment; it would be great if somebody else approved it)

@github-actions

github-actions Bot commented Jun 29, 2026

Copy link
Copy Markdown

Test Results

403 tests   388 ✅  33m 27s ⏱️
 55 suites   15 💤
 55 files      0 ❌

Results for commit 9fac0bb.

♻️ This comment has been updated with latest results.

@stevenwinship stevenwinship moved this from Ready for QA ⏩ to Ready for Review ⏩ in IQSS Dataverse Project Jun 29, 2026
@pdurbin pdurbin moved this from Ready for Review ⏩ to In Review 🔎 in IQSS Dataverse Project Jun 30, 2026
@pdurbin pdurbin self-assigned this Jun 30, 2026

@pdurbin pdurbin left a comment

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Details in the review.

@stevenwinship stevenwinship force-pushed the performance-fix-for-dataset-version-has-files branch from 644795a to 1a19e93 Compare June 30, 2026 13:35
@github-actions

This comment has been minimized.

@stevenwinship stevenwinship requested a review from pdurbin June 30, 2026 18:51
<h:outputText value="#{bundle['embargoed']}" styleClass="label label-primary" rendered="#{SearchIncludeFragment.isActivelyEmbargoed(result)}"/>
<h:outputText value="#{bundle['retentionExpired']}" styleClass="label label-warning" rendered="#{SearchIncludeFragment.isRetentionExpired(result)}"/>
<h:outputText value="#{DatasetUtil:getLocaleCurationStatusLabelFromString(result.curationStatus)}" styleClass="label label-info curation-status" rendered="#{SearchIncludeFragment.canSeeCurationStatus(result.entityId)}"/>
<h:outputText value="#{DatasetUtil:getLocaleCurationStatusLabelFromString(result.curationStatus)}" styleClass="label label-info curation-status" rendered="#{SearchIncludeFragment.canSeeCurationStatus(result.entity)}"/>

@qqmyers qqmyers Jun 30, 2026

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

entity is only non-null if retrieveEntities was set in the search call - see

if (retrieveEntities) {
solrSearchResult.setEntity(dvObjectService.findDvObject(entityid));
}
- I don't know if that's the case in all the uses of this fragment or not.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

added a check for null entity

@landreev landreev Jun 30, 2026

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I thought we had given up on ever being able to avoid instantiating that entity when running searches from within the dataverse page.
But yes, since that retrieveEntities argument exists, it is prudent to assume the entity can be null, as in the latest commit.

@stevenwinship stevenwinship Jun 30, 2026

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this fix also saves the entity in the result so it should only look it up once

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'd suggest adding a check in the render logic to only go into all of this if result.curationStatus is not null/blank. Hopefully relatively few datasets will have one (only draft, only if they curation Status labels are used, etc.).

In general, I'm not a big fan of hiding the logic to get an entity off in this corner case - if the entity at this call is null, code moved from after to before this call could suddenly fail with a null entity (or also have to implement a check). (Or, more fun, if you add the curationStatus null check, some unrelated code could succeed when there is a curationStatus and fail when there isn't.)

It may be out of scope for a quick fix (is a quick fix needed if you check for curationStatus being null?), but it might be better to only getEntity() when needed and drop the retrieveEntities flag which gets them for every result even if some results never use it.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

added check so the canSeeCurationStatus only gets called if there is a curation status.
reverted back to sending the dataset id and looking up each time. There were 5 calls to get the dataset but only for 1 search result entity id.

@pdurbin pdurbin moved this from Ready for Review ⏩ to In Progress 💻 in IQSS Dataverse Project Jun 30, 2026
@stevenwinship stevenwinship moved this from In Progress 💻 to Ready for Review ⏩ in IQSS Dataverse Project Jun 30, 2026
@stevenwinship stevenwinship removed their assignment Jun 30, 2026
@github-actions

This comment has been minimized.

1 similar comment
@github-actions

This comment has been minimized.

@stevenwinship stevenwinship self-assigned this Jun 30, 2026
@stevenwinship stevenwinship moved this from Ready for Review ⏩ to In Progress 💻 in IQSS Dataverse Project Jun 30, 2026
@stevenwinship stevenwinship removed their assignment Jun 30, 2026
@stevenwinship stevenwinship moved this from In Progress 💻 to Ready for Review ⏩ in IQSS Dataverse Project Jun 30, 2026
@github-actions

This comment has been minimized.

<h:outputText value="#{bundle['embargoed']}" styleClass="label label-primary" rendered="#{SearchIncludeFragment.isActivelyEmbargoed(result)}"/>
<h:outputText value="#{bundle['retentionExpired']}" styleClass="label label-warning" rendered="#{SearchIncludeFragment.isRetentionExpired(result)}"/>
<h:outputText value="#{DatasetUtil:getLocaleCurationStatusLabelFromString(result.curationStatus)}" styleClass="label label-info curation-status" rendered="#{SearchIncludeFragment.canSeeCurationStatus(result.entityId)}"/>
<h:outputText value="#{DatasetUtil:getLocaleCurationStatusLabelFromString(result.curationStatus)}" styleClass="label label-info curation-status" rendered="#{!empty result.curationStatus and SearchIncludeFragment.canSeeCurationStatus(result.entityId)}"/>

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I totally support adding the curationStatus check to the rendering logic.
... but I am confused as to why we are back to passing result.entityId, and looking up the dvobject in the SearchIncludeFragment.java method - ? ... even knowing that in most cases the lookups will be avoided by the curationStatus checks (although I am hearing we'll be enabling these curation labels at Harvard too).

For 6.11 I really wanted to suggest the opposite: keep passing the result, or result.entity, and simply return false if the entity is null. The entity should be there in normal operations. There are other methods in the fragment relying on result.entity.

... @qqmyers says he's ok either way via slack. I know it's already taken longer than necessary. But it just feels wrong seeing those lookups back in the code, after having seen the code that avoided them - no?

I support revisiting and cleaning up how we handle solr search results past 6.11 also.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Changed

@stevenwinship stevenwinship moved this from Ready for Review ⏩ to In Progress 💻 in IQSS Dataverse Project Jul 1, 2026
@stevenwinship stevenwinship self-assigned this Jul 1, 2026
@stevenwinship stevenwinship removed their assignment Jul 1, 2026
@stevenwinship stevenwinship moved this from In Progress 💻 to Ready for Review ⏩ in IQSS Dataverse Project Jul 1, 2026
@github-actions

github-actions Bot commented Jul 1, 2026

Copy link
Copy Markdown

📦 Pushed preview images as

ghcr.io/gdcc/dataverse:performance-fix-for-dataset-version-has-files
ghcr.io/gdcc/configbaker:performance-fix-for-dataset-version-has-files

🚢 See on GHCR. Use by referencing with full name as printed above, mind the registry name.

@landreev landreev left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Will re-test and merge this long-suffering thing momentarily.

@github-project-automation github-project-automation Bot moved this from Ready for Review ⏩ to Ready for QA ⏩ in IQSS Dataverse Project Jul 1, 2026
@landreev

landreev commented Jul 1, 2026

Copy link
Copy Markdown
Contributor

Merging; will add some numbers in the 6.11 perf. issue.

@landreev landreev merged commit 2f723c8 into develop Jul 1, 2026
21 of 22 checks passed
@github-project-automation github-project-automation Bot moved this from Ready for QA ⏩ to Merged 🚀 in IQSS Dataverse Project Jul 1, 2026
@stevenwinship stevenwinship deleted the performance-fix-for-dataset-version-has-files branch July 1, 2026 14:50
@pdurbin pdurbin moved this from Merged 🚀 to Done 🧹 in IQSS Dataverse Project Jul 1, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Feature: Performance & Stability Size: 10 A percentage of a sprint. 7 hours.

Projects

Status: Done 🧹

Development

Successfully merging this pull request may close these issues.

5 participants